Computation and Language
☆ Efficient Induction of Language Models Via Probabilistic Concept Formation
This paper presents a novel approach to the acquisition of language models
from corpora. The framework builds on Cobweb, an early system for constructing
taxonomic hierarchies of probabilistic concepts that used a tabular,
attribute-value encoding of training cases and concepts, making it unsuitable
for sequential input like language. In response, we explore three new
extensions to Cobweb -- the Word, Leaf, and Path variants. These systems encode
each training case as an anchor word and surrounding context words, and they
store probabilistic descriptions of concepts as distributions over anchor and
context information. As in the original Cobweb, a performance element sorts a
new instance downward through the hierarchy and uses the final node to predict
missing features. Learning is interleaved with performance, updating concept
probabilities and hierarchy structure as classification occurs. Thus, the new
approaches process training cases in an incremental, online manner that it very
different from most methods for statistical language learning. We examine how
well the three variants place synonyms together and keep homonyms apart, their
ability to recall synonyms as a function of training set size, and their
training efficiency. Finally, we discuss related work on incremental learning
and directions for further research.
comment: 18 pages, 5 figures, Presented at Advances in Cognitive Systems 2022
☆ Multilingual News Location Detection using an Entity-Based Siamese Network with Semi-Supervised Contrastive Learning and Knowledge Base
Early detection of relevant locations in a piece of news is especially
important in extreme events such as environmental disasters, war conflicts,
disease outbreaks, or political turmoils. Additionally, this detection also
helps recommender systems to promote relevant news based on user locations.
Note that, when the relevant locations are not mentioned explicitly in the
text, state-of-the-art methods typically fail to recognize them because these
methods rely on syntactic recognition. In contrast, by incorporating a
knowledge base and connecting entities with their locations, our system
successfully infers the relevant locations even when they are not mentioned
explicitly in the text. To evaluate the effectiveness of our approach, and due
to the lack of datasets in this area, we also contribute to the research
community with a gold-standard multilingual news-location dataset, NewsLOC. It
contains the annotation of the relevant locations (and their WikiData IDs) of
600+ Wikinews articles in five different languages: English, French, German,
Italian, and Spanish. Through experimental evaluations, we show that our
proposed system outperforms the baselines and the fine-tuned version of the
model using semi-supervised data that increases the classification rate. The
source code and the NewsLOC dataset are publicly available for being used by
the research community at https://github.com/vsuarezpaniagua/NewsLocation.
☆ Training Integer-Only Deep Recurrent Neural Networks
Recurrent neural networks (RNN) are the backbone of many text and speech
applications. These architectures are typically made up of several
computationally complex components such as; non-linear activation functions,
normalization, bi-directional dependence and attention. In order to maintain
good accuracy, these components are frequently run using full-precision
floating-point computation, making them slow, inefficient and difficult to
deploy on edge devices. In addition, the complex nature of these operations
makes them challenging to quantize using standard quantization methods without
a significant performance drop. We present a quantization-aware training method
for obtaining a highly accurate integer-only recurrent neural network (iRNN).
Our approach supports layer normalization, attention, and an adaptive piecewise
linear (PWL) approximation of activation functions, to serve a wide range of
state-of-the-art RNNs. The proposed method enables RNN-based language models to
run on edge devices with $2\times$ improvement in runtime, and $4\times$
reduction in model size while maintaining similar accuracy as its
full-precision counterpart.
comment: arXiv admin note: substantial text overlap with arXiv:2109.09828
☆ GENIE: Large Scale Pre-training for Text Generation with Diffusion Model
In this paper, we propose a large-scale language pre-training for text
GENeration using dIffusion modEl, which is named GENIE. GENIE is a pre-training
sequence-to-sequence text generation model which combines Transformer and
diffusion. The diffusion model accepts the latent information from the encoder,
which is used to guide the denoising of the current time step. After multiple
such denoise iterations, the diffusion model can restore the Gaussian noise to
the diverse output text which is controlled by the input text. Moreover, such
architecture design also allows us to adopt large scale pre-training on the
GENIE. We propose a novel pre-training method named continuous paragraph
denoise based on the characteristics of the diffusion model. Extensive
experiments on the XSum, CNN/DailyMail, and Gigaword benchmarks shows that
GENIE can achieves comparable performance with various strong baselines,
especially after pre-training, the generation quality of GENIE is greatly
improved. We have also conduct a lot of experiments on the generation diversity
and parameter impact of GENIE. The code for GENIE will be made publicly
available.
comment: Work in progress
☆ CAMeMBERT: Cascading Assistant-Mediated Multilingual BERT
Large language models having hundreds of millions, and even billions, of
parameters have performed extremely well on a variety of natural language
processing (NLP) tasks. Their widespread use and adoption, however, is hindered
by the lack of availability and portability of sufficiently large computational
resources. This paper proposes a knowledge distillation (KD) technique building
on the work of LightMBERT, a student model of multilingual BERT (mBERT). By
repeatedly distilling mBERT through increasingly compressed toplayer distilled
teacher assistant networks, CAMeMBERT aims to improve upon the time and space
complexities of mBERT while keeping loss of accuracy beneath an acceptable
threshold. At present, CAMeMBERT has an average accuracy of around 60.1%, which
is subject to change after future improvements to the hyperparameters used in
fine-tuning.
comment: 4 pages, 2 figures, 3 tables
☆ Understanding Postpartum Parents' Experiences via Two Digital Platforms SC
Xuewen Yao, Miriam Mikhelson, Megan Micheletti, Eunsol Choi, S Craig Watkins, Edison Thomaz, Kaya De Barbaro
Digital platforms, including online forums and helplines, have emerged as
avenues of support for caregivers suffering from postpartum mental health
distress. Understanding support seekers' experiences as shared on these
platforms could provide crucial insight into caregivers' needs during this
vulnerable time. In the current work, we provide a descriptive analysis of the
concerns, psychological states, and motivations shared by healthy and
distressed postpartum support seekers on two digital platforms, a one-on-one
digital helpline and a publicly available online forum. Using a combination of
human annotations, dictionary models and unsupervised techniques, we find stark
differences between the experiences of distressed and healthy mothers.
Distressed mothers described interpersonal problems and a lack of support, with
8.60% - 14.56% reporting severe symptoms including suicidal ideation. In
contrast, the majority of healthy mothers described childcare issues, such as
questions about breastfeeding or sleeping, and reported no severe mental health
concerns. Across the two digital platforms, we found that distressed mothers
shared similar content. However, the patterns of speech and affect shared by
distressed mothers differed between the helpline vs. the online forum,
suggesting the design of these platforms may shape meaningful measures of their
support-seeking experiences. Our results provide new insight into the
experiences of caregivers suffering from postpartum mental health distress. We
conclude by discussing methodological considerations for understanding content
shared by support seekers and design considerations for the next generation of
support tools for postpartum parents.
comment: Will be published in PACM HCI, CSCW1, April 2023 issue
♻ ☆ Chatbots in a Botnet World
Question-and-answer formats provide a novel experimental platform for
investigating cybersecurity questions. Unlike previous chatbots, the latest
ChatGPT model from OpenAI supports an advanced understanding of complex coding
questions. The research demonstrates thirteen coding tasks that generally
qualify as stages in the MITRE ATT&CK framework, ranging from credential access
to defense evasion. With varying success, the experimental prompts generate
examples of keyloggers, logic bombs, obfuscated worms, and payment-fulfilled
ransomware. The empirical results illustrate cases that support the broad gain
of functionality, including self-replication and self-modification, evasion,
and strategic understanding of complex cybersecurity goals. One surprising
feature of ChatGPT as a language-only model centers on its ability to spawn
coding approaches that yield images that obfuscate or embed executable
programming steps or links.
♻ ☆ Keyphrase Generation with Cross-Document Attention
Keyphrase generation aims to produce a set of phrases summarizing the
essentials of a given document. Conventional methods normally apply an
encoder-decoder architecture to generate the output keyphrases for an input
document, where they are designed to focus on each current document so they
inevitably omit crucial corpus-level information carried by other similar
documents, i.e., the cross-document dependency and latent topics. In this
paper, we propose CDKGen, a Transformer-based keyphrase generator, which
expands the Transformer to global attention with cross-document attention
networks to incorporate available documents as references so as to generate
better keyphrases with the guidance of topic information. On top of the
proposed Transformer + cross-document attention architecture, we also adopt a
copy mechanism to enhance our model via selecting appropriate words from
documents to deal with out-of-vocabulary words in keyphrases. Experiment
results on five benchmark datasets illustrate the validity and effectiveness of
our model, which achieves the state-of-the-art performance on all datasets.
Further analyses confirm that the proposed model is able to generate keyphrases
consistent with references while keeping sufficient diversity. The code of
CDKGen is available at https://github.com/SVAIGBA/CDKGen.
comment: This paper will be superseded by another improved version with new
approaches, new settings, and new experimental results
♻ ☆ Towards mapping the contemporary art world with ArtLM: an art-specific NLP model
Qinkai Chen, Mohamed El-Mennaoui, Antoine Fosset, Amine Rebei, Haoyang Cao, Philine Bouscasse, Christy Eóin O'Beirne, Sasha Shevchenko, Mathieu Rosenbaum
With an increasing amount of data in the art world, discovering artists and
artworks suitable to collectors' tastes becomes a challenge. It is no longer
enough to use visual information, as contextual information about the artist
has become just as important in contemporary art. In this work, we present a
generic Natural Language Processing framework (called ArtLM) to discover the
connections among contemporary artists based on their biographies. In this
approach, we first continue to pre-train the existing general English language
models with a large amount of unlabelled art-related data. We then fine-tune
this new pre-trained model with our biography pair dataset manually annotated
by a team of professionals in the art industry. With extensive experiments, we
demonstrate that our ArtLM achieves 85.6% accuracy and 84.0% F1 score and
outperforms other baseline models. We also provide a visualisation and a
qualitative analysis of the artist network built from ArtLM's outputs.
♻ ☆ Compositional generalization in semantic parsing with pretrained transformers
Large-scale pretraining instills large amounts of knowledge in deep neural
networks. This, in turn, improves the generalization behavior of these models
in downstream tasks. What exactly are the limits to the generalization benefits
of large-scale pretraining? Here, we report observations from some simple
experiments aimed at addressing this question in the context of two semantic
parsing tasks involving natural language, SCAN and COGS. We show that language
models pretrained exclusively with non-English corpora, or even with
programming language corpora, significantly improve out-of-distribution
generalization in these benchmarks, compared with models trained from scratch,
even though both benchmarks are English-based. This demonstrates the
surprisingly broad transferability of pretrained representations and knowledge.
Pretraining with a large-scale protein sequence prediction task, on the other
hand, mostly deteriorates the generalization performance in SCAN and COGS,
suggesting that pretrained representations do not transfer universally and that
there are constraints on the similarity between the pretraining and downstream
domains for successful transfer. Finally, we show that larger models are harder
to train from scratch and their generalization accuracy is lower when trained
up to convergence on the relatively small SCAN and COGS datasets, but the
benefits of large-scale pretraining become much clearer with larger models.
comment: v3 adds one reference, adds further discussion, slightly changes
formatting
♻ ☆ SimpleStyle: An Adaptable Style Transfer Approach
Attribute-controlled text rewriting, also known as text style-transfer, has a
crucial role in regulating attributes and biases of textual training data and a
machine generated text. In this work we present SimpleStyle, a minimalist yet
effective approach for style-transfer composed of two simple ingredients:
controlled denoising and output filtering. Despite the simplicity of our
approach, which can be succinctly described with a few lines of code, it is
competitive with previous state-of-the-art methods both in automatic and in
human evaluation. To demonstrate the adaptability and practical value of our
system beyond academic data, we apply SimpleStyle to transfer a wide range of
text attributes appearing in real-world textual data from social networks.
Additionally, we introduce a novel "soft noising" technique that further
improves the performance of our system. We also show that teaching a student
model to generate the output of SimpleStyle can result in a system that
performs style transfer of equivalent quality with only a single greedy-decoded
sample. Finally, we suggest our method as a remedy for the fundamental
incompatible baseline issue that holds progress in the field. We offer our
protocol as a simple yet strong baseline for works that wish to make
incremental advancements in the field of attribute controlled text rewriting.
♻ ☆ NarrativeTime: Dense Temporal Annotation on a Timeline
For the past decade, temporal annotation has been sparse: only a small
portion of event pairs in a text was annotated. We present NarrativeTime, the
first timeline-based annotation framework that achieves full coverage of all
possible TLinks. To compare with the previous SOTA in dense temporal
annotation, we perform full re-annotation of TimeBankDense corpus, which shows
comparable agreement with a significant increase in density. We contribute
TimeBankNT corpus (with each text fully annotated by two expert annotators),
extensive annotation guidelines, open-source tools for annotation and
conversion to TimeML format, baseline results, as well as quantitative and
qualitative analysis of inter-annotator agreement.
♻ ☆ Better Transcription of UK Supreme Court Hearings
Transcription of legal proceedings is very important to enable access to
justice. However, speech transcription is an expensive and slow process. In
this paper we describe part of a combined research and industrial project for
building an automated transcription tool designed specifically for the Justice
sector in the UK. We explain the challenges involved in transcribing court room
hearings and the Natural Language Processing (NLP) techniques we employ to
tackle these challenges. We will show that fine-tuning a generic off-the-shelf
pre-trained Automatic Speech Recognition (ASR) system with an in-domain
language model as well as infusing common phrases extracted with a collocation
detection model can improve not only the Word Error Rate (WER) of the
transcribed hearings but avoid critical errors that are specific of the legal
jargon and terminology commonly used in British courts.
♻ ☆ InvBERT: Reconstructing Text from Contextualized Word Embeddings by inverting the BERT pipeline
Digital Humanities and Computational Literary Studies apply text mining
methods to investigate literature. Such automated approaches enable
quantitative studies on large corpora which would not be feasible by manual
inspection alone. However, due to copyright restrictions, the availability of
relevant digitized literary works is limited. Derived Text Formats (DTFs) have
been proposed as a solution. Here, textual materials are transformed in such a
way that copyright-critical features are removed, but that the use of certain
analytical methods remains possible. Contextualized word embeddings produced by
transformer-encoders (like BERT) are promising candidates for DTFs because they
allow for state-of-the-art performance on various analytical tasks and, at
first sight, do not disclose the original text. However, in this paper we
demonstrate that under certain conditions the reconstruction of the original
copyrighted text becomes feasible and its publication in the form of
contextualized token representations is not safe. Our attempts to invert BERT
suggest, that publishing the encoder as a black box together with the
contextualized embeddings is critical, since it allows to generate data to
train a decoder with a reconstruction accuracy sufficient to violate copyright
laws.
♻ ☆ Detect-Localize-Repair: A Unified Framework for Learning to Debug with CodeT5 EMNLP 2022
Automated software debugging is a crucial task for improving the productivity
of software developers. Many neural-based techniques have been proven effective
for debugging-related tasks such as bug localization and program repair (or bug
fixing). However, these techniques often focus only on either one of them or
approach them in a stage-wise manner, ignoring the mutual benefits between
them. In this work, we propose a novel unified \emph{Detect-Localize-Repair}
framework based on a pretrained programming language model CodeT5 to seamlessly
address these tasks, named CodeT5-DLR. Specifically, we propose three
objectives to adapt the generic CodeT5 for debugging: a bug detection objective
to determine whether a given code snippet is buggy or not, a bug localization
objective to identify the buggy lines, and a program repair objective to
translate the buggy code to its fixed version. We evaluate it on each of these
tasks and their combined setting on two newly collected line-level debugging
datasets in Java and Python. Extensive results show that our model
significantly outperforms existing baselines from both NLP and software
engineering domains.
comment: Accepted to EMNLP 2022 Findings Track
♻ ☆ Online Neural Diarization of Unlimited Numbers of Speakers Using Global and Local Attractors
A method to perform offline and online speaker diarization for an unlimited
number of speakers is described in this paper. End-to-end neural diarization
(EEND) has achieved overlap-aware speaker diarization by formulating it as a
multi-label classification problem. It has also been extended for a flexible
number of speakers by introducing speaker-wise attractors. However, the output
number of speakers of attractor-based EEND is empirically capped; it cannot
deal with cases where the number of speakers appearing during inference is
higher than that during training because its speaker counting is trained in a
fully supervised manner. Our method, EEND-GLA, solves this problem by
introducing unsupervised clustering into attractor-based EEND. In the method,
the input audio is first divided into short blocks, then attractor-based
diarization is performed for each block, and finally, the results of each block
are clustered on the basis of the similarity between locally-calculated
attractors. While the number of output speakers is limited within each block,
the total number of speakers estimated for the entire input can be higher than
the limitation. To use EEND-GLA in an online manner, our method also extends
the speaker-tracing buffer, which was originally proposed to enable online
inference of conventional EEND. We introduce a block-wise buffer update to make
the speaker-tracing buffer compatible with EEND-GLA. Finally, to improve online
diarization, our method improves the buffer update method and revisits the
variable chunk-size training of EEND. The experimental results demonstrate that
EEND-GLA can perform speaker diarization of an unseen number of speakers in
both offline and online inferences.
comment: Accepted to IEEE/ACM TASLP
♻ ☆ Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models
In the era of digital healthcare, the huge volumes of textual information
generated every day in hospitals constitute an essential but underused asset
that could be exploited with task-specific, fine-tuned biomedical language
representation models, improving patient care and management. For such
specialized domains, previous research has shown that fine-tuning models
stemming from broad-coverage checkpoints can largely benefit additional
training rounds over large-scale in-domain resources. However, these resources
are often unreachable for less-resourced languages like Italian, preventing
local medical institutions to employ in-domain adaptation. In order to reduce
this gap, our work investigates two accessible approaches to derive biomedical
language models in languages other than English, taking Italian as a concrete
use-case: one based on neural machine translation of English resources,
favoring quantity over quality; the other based on a high-grade, narrow-scoped
corpus natively written in Italian, thus preferring quality over quantity. Our
study shows that data quantity is a harder constraint than data quality for
biomedical adaptation, but the concatenation of high-quality data can improve
model performance even when dealing with relatively size-limited corpora. The
models published from our investigations have the potential to unlock important
research opportunities for Italian hospitals and academia. Finally, the set of
lessons learned from the study constitutes valuable insights towards a solution
to build biomedical language models that are generalizable to other
less-resourced languages and different domain settings.
comment: 8 pages, 3 figures
♻ ☆ An Augmentation Strategy for Visually Rich Documents
Many business workflows require extracting important fields from form-like
documents (e.g. bank statements, bills of lading, purchase orders, etc.).
Recent techniques for automating this task work well only when trained with
large datasets. In this work we propose a novel data augmentation technique to
improve performance when training data is scarce, e.g. 10-250 documents. Our
technique, which we call FieldSwap, works by swapping out the key phrases of a
source field with the key phrases of a target field to generate new synthetic
examples of the target field for use in training. We demonstrate that this
approach can yield 1-7 F1 point improvements in extraction performance.
comment: 9 pages, 6 figures, 3 tables
♻ ☆ Implementation of general formal translators
The general translator formalism and computing specific implementations are
proposed. The implementation of specific elements necessary to process the
source and destination information within the translators are presented. Some
common directives or instructions, such as classes and procedures, were unified
and generalized in order to allow general translations implementations. In
order to cover general cases, two levels of processing are required, related to
the source and destination information appropriate transformations, with the
related control and processing instructions. The proposed general translator
elements are useful for processing natural or artificial information described
through any types of languages or systems.
♻ ☆ Speech Synthesis with Mixed Emotions
Emotional speech synthesis aims to synthesize human voices with various
emotional effects. The current studies are mostly focused on imitating an
averaged style belonging to a specific emotion type. In this paper, we seek to
generate speech with a mixture of emotions at run-time. We propose a novel
formulation that measures the relative difference between the speech samples of
different emotions. We then incorporate our formulation into a
sequence-to-sequence emotional text-to-speech framework. During the training,
the framework does not only explicitly characterize emotion styles, but also
explores the ordinal nature of emotions by quantifying the differences with
other emotions. At run-time, we control the model to produce the desired
emotion mixture by manually defining an emotion attribute vector. The objective
and subjective evaluations have validated the effectiveness of the proposed
framework. To our best knowledge, this research is the first study on
modelling, synthesizing, and evaluating mixed emotions in speech.
comment: Accepted to IEEE Transactions on Affective Computing
♻ ☆ ConvNeXt Based Neural Network for Audio Anti-Spoofing
With the rapid development of speech conversion and speech synthesis
algorithms, automatic speaker verification (ASV) systems are vulnerable to
spoofing attacks. In recent years, researchers had proposed a number of
anti-spoofing methods based on hand-crafted features. However, using
hand-crafted features rather than raw waveform will lose implicit information
for anti-spoofing. Inspired by the promising performance of ConvNeXt in image
classification tasks, we revise the ConvNeXt network architecture and propose a
lightweight end-to-end anti-spoofing model. By integrating with the channel
attention block and using the focal loss function, the proposed model can focus
on the most informative sub-bands of speech representations and the difficult
samples that are hard to classify. Experiments show that our proposed system
could achieve an equal error rate of 0.64% and min-tDCF of 0.0187 for the
ASVSpoof 2019 LA evaluation dataset, which outperforms the state-of-the-art
systems.
comment: 6 pages
♻ ☆ An Efficient Drug-Drug Interactions Prediction Technology for Molecularly Intelligent Manufacturing
Drug-Drug Interactions (DDIs) prediction is an essential issue in the
molecular field. Traditional methods of observing DDIs in medical experiments
require plenty of resources and labor. In this paper, we present a
computational model dubbed MedKGQA based on Graph Neural Networks to
automatically predict the DDIs after reading multiple medical documents in the
form of multi-hop machine reading comprehension. We introduced a knowledge
fusion system to obtain the complete nature of drugs and proteins and exploited
a graph reasoning system to infer the drugs and proteins contained in the
documents. Our model significantly improves the performance compared to
previous state-of-the-art models on the QANGAROO MedHop dataset, which obtained
a 4.5% improvement in terms of DDIs prediction accuracy.